Skip to content

Feature/validator.ValidateMulti()#405

Draft
freemans13 wants to merge 16 commits intobsv-blockchain:mainfrom
freemans13:feature/validator-validate-multiple
Draft

Feature/validator.ValidateMulti()#405
freemans13 wants to merge 16 commits intobsv-blockchain:mainfrom
freemans13:feature/validator-validate-multiple

Conversation

@freemans13
Copy link
Collaborator

No description provided.

icellan and others added 4 commits January 15, 2026 17:22
Fixes chain validation failure when syncing more than 10,000 blocks.

The bug occurred because GetBlockHeadersFromOldest includes the starting
block (uses >= in SQL), causing duplicate headers at iteration boundaries.
When catchup spans multiple iterations (every 10,000 blocks), the last
header from iteration N becomes the first header in iteration N+1,
creating a duplicate that breaks chain validation.

Changes:
- Skip first header in iterations 2+ if it matches the last header from
  the previous iteration
- Handle edge case where all headers are duplicates (chain tip reached)
- Add debug logging for duplicate detection

Tests:
- Add comprehensive multi-iteration tests covering 10K, 25K header scenarios
- Test edge cases (single duplicate header, exact boundary conditions)
- Verify chain continuity across iteration boundaries
- Validate header chain cache builds correctly with multi-iteration headers

All existing tests pass.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
@freemans13 freemans13 self-assigned this Jan 15, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 15, 2026

🤖 Claude Code Review

Status: Complete


Summary:

This PR introduces significant architectural changes: a custom worker pool for subtree validation and concurrent catchup sessions with sequential validation ordering. The implementation is generally sound, but several areas need attention.

Key Issues:

  1. Worker pool error handling - Returns first error only, losing visibility into multiple validation failures. Consider collecting all errors for better debugging.

  2. Aggressive worker count (64x CPU cores) - Defaults to 512 workers on 8-core systems without benchmark justification. Previous implementation used different concurrency calculation - needs performance data to validate this change.

  3. Semaphore test coverage - Missing tests for critical concurrent validation behavior (blocking, error paths, cancellation while waiting).

  4. Aerospike config changes - 4x-5x increases to cache sizes and delays appear unrelated to PR title. Should be separated or justified.

  5. Worker pool cleanup methods - Both Close() and Shutdown() exist but only one is used. Needs documentation or consolidation.

See inline comments for details. The core concurrent catchup session management looks well-designed with proper session tracking and metrics.

@freemans13 freemans13 changed the title Feature/validator validate multiple Feature/validator.ValidateMulti() Jan 16, 2026
for _, result := range pool.results {
if result.err != nil {
// Do not wrap the error again, the error is already wrapped
return result.err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Early return on first error loses other validation errors

The current implementation returns immediately on the first subtree validation error found. This means if multiple subtrees have validation errors, only the first one discovered (which could be any due to concurrent execution) will be reported.

This makes debugging harder because developers won't see all the validation failures at once. Consider either:

  1. Collecting all errors and returning a combined error
  2. Documenting this "fail-fast" behavior explicitly in comments

The previous errgroup implementation had the same behavior, but it's worth considering if this is the desired outcome for block validation.

// Close closes the job channel and waits for all workers to finish
func (p *subtreeWorkerPool) Close() {
close(p.jobs)
p.wg.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential resource leak: Context cancellation without cleanup

The Shutdown() method cancels the context and closes the channel, but Close() just closes the channel without canceling. If a worker is blocked on p.ctx.Done() and the caller uses Close() instead of Shutdown(), the context will never be canceled.

However, I see this is only called in one place (Block.go:700) and it uses Close(). Since the parent context passed to the pool should handle cancellation via the tracing/request lifecycle, this may be acceptable.

Consider either:

  1. Documenting when to use Close() vs Shutdown()
  2. Removing Shutdown() if it's never used
  3. Always canceling the context in Close() for safety

// Subtree validation is ~97% I/O (file reads from blob store)
// High concurrency needed to saturate disk I/O throughput
// On 8-core machine: 512 workers
numWorkers := runtime.GOMAXPROCS(0) * 64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely high default worker count: 64x CPU cores

This defaults to 512 workers on an 8-core machine. While the comment justifies this for I/O-heavy operations, this is an unusually aggressive multiplier that could cause issues:

  1. Each worker is a goroutine with its own stack (minimum 2KB, grows to 1GB)
  2. 512 concurrent blob store reads could overwhelm file descriptors, network connections, or disk I/O queues
  3. The comment mentions "~97% I/O" but provides no benchmark data to support the 64x multiplier

Consider:

  1. Adding benchmark results showing that 64x actually improves performance over lower values (8x, 16x, 32x)
  2. Making this configurable via settings with a more conservative default (e.g., 16x)
  3. Testing with actual production block sizes to ensure this doesn't cause resource exhaustion

The previous errgroup implementation used getValidationConcurrency() which appears to use a different calculation - was that measured to be insufficient?

// This blocks until the previous session's validation completes
u.logger.Infof("[catchup][%s][session:%s] waiting for validation semaphore", catchupCtx.blockUpTo.Hash().String(), catchupCtx.sessionID)
select {
case u.validationSemaphore <- struct{}{}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for validation semaphore blocking behavior

The validation semaphore is critical for ensuring sequential validation across concurrent catchup sessions, but I don't see tests that verify:

  1. Multiple sessions can download concurrently (as intended)
  2. Only one session can validate at a time (enforced by semaphore)
  3. Sessions properly release the semaphore even when validation fails
  4. Sessions handle context cancellation while waiting for the semaphore

The existing test TestCatchup_ConcurrentCatchupSessions only checks session registration/unregistration, not the actual semaphore blocking behavior. Consider adding integration tests that verify these concurrent behaviors.

# can be dangerous if the instance crashes or the storage can't keep up at all
# monitor the queue with `asadm -e "show statistics like write_q"`
max-write-cache 1024M
max-write-cache 4096M
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Significant Aerospike tuning changes without justification

This PR changes several critical Aerospike performance parameters:

  • post-write-cache: 256 -> 1024 (4x increase)
  • defrag-sleep: 2000 -> 10000 (5x slower defragmentation)
  • flush-max-ms: 1000 -> 5000 (5x longer flush delay)
  • max-write-cache: 1024M -> 4096M (4x larger write buffer)

These changes significantly affect UTXO store write performance and memory usage. While the PR title mentions ValidateMulti(), the Aerospike tuning seems unrelated to that feature.

Questions:

  1. Are these changes necessary for the worker pool/concurrent catchup features?
  2. Were these settings benchmarked with production workloads?
  3. Should these be in a separate PR focused on storage optimization?
  4. What happens on systems with less available RAM when max-write-cache is 4GB?

@freemans13 freemans13 marked this pull request as draft January 19, 2026 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants